階層結構的基礎
記憶體階層結構依賴於以下兩者之間的權衡 靜態隨機存取記憶體(SRAM) 與 動態隨機存取記憶體(DRAM)。SRAM 使用一個由六個電晶體組成的 雙穩態記憶體單元。想像一個倒置的擺桿:它在兩個位置上是穩定的,但在中間位置則是 亞穩定 狀態。這種雙穩態特性使 SRAM 具有高速、高成本且對干擾不敏感的特點。相反地,DRAM 則以微小電容器中的電荷來儲存位元(約 30 × 10⁻¹⁵ 法拉)。由於電荷會洩漏,因此 DRAM 速度較慢,必須持續刷新。
DRAM 組織與匯流排交易
為減少接腳數量,DRAM 的位元被分割成 $d$ 個 超細胞 的 $r \times c$ 網格,其中 $rc=d$。存取資料需要兩個步驟: 記憶體控制器 發送一個 RAS(列存取觸發信號),將一列移至列緩衝區,接著再發送一個 CAS(欄存取觸發信號)。這解釋了為什麼 sumarraycols 本質上較慢:它反覆錯過列緩衝區。
資料傳輸
資料透過 匯流排交易 跨越 系統匯流排 與 記憶體匯流排,並由 I/O 橋接器。一個 movq A, %rax 指令(讀取交易)會觸發橋接器將中央處理器的請求轉換為 DRAM 的網格訊號。
main.py
TERMINALbash — 80x24
> Ready. Click "Run" to execute.
>
QUESTION 1
Which physical characteristic explains why SRAM is faster but less dense than DRAM?
SRAM uses capacitors that require periodic refreshing.
SRAM uses a 6-transistor bistable cell, while DRAM uses a single transistor and capacitor.
DRAM uses the inverted pendulum principle for stability.
SRAM requires RAS/CAS strobing for every bit access.
✅ Correct!
SRAM's bistable state is maintained by 6 transistors, whereas DRAM's 1-transistor/1-capacitor design allows for much higher density at the cost of speed and volatility.❌ Incorrect
Capacitors are characteristic of DRAM, not SRAM. SRAM's speed comes from its transistor-only bistable design.QUESTION 2
In a 128 x 8 DRAM, why does the controller send the Row Address (RAS) and Column Address (CAS) separately?
To increase the power consumption defined by P = fCV².
To allow the CPU to perform polynomial evaluation in between.
To reduce the number of address pins required on the chip by minimizing max(br, bc).
To ensure the firmware can intercept the memory bus transaction.
✅ Correct!
Multiplexing the address into rows and columns allows the chip to use fewer pins, specifically max(br, bc) instead of the full address width.❌ Incorrect
Separating RAS and CAS is a pin-count optimization, not a performance or power-saving strategy.QUESTION 3
Identify the sequence for a 'Read transaction' of 'movq A, %rax'.
CPU places address on System Bus -> I/O Bridge translates to Memory Bus -> DRAM returns data to CPU.
DRAM sends data to System Bus -> I/O Bridge sends to Register file.
CPU places data on Memory Bus -> Bridge translates to System Bus -> Address A is updated.
Register %rax moves to I/O Bridge -> System Bus -> Main Memory.
✅ Correct!
In a read, the address flows from CPU to memory, and the data flows from memory back to the CPU via the bridge.❌ Incorrect
This describes a write transaction or a physically impossible flow. Data flows in response to an address request.QUESTION 4
Match the address partitioning component: CI
The cache block offset
The cache set index
The cache tag
The DRAM supercell width
✅ Correct!
CI stands for Cache Set Index. CO is the Offset and CT is the Tag.❌ Incorrect
Review the partitioning: CT (Tag), CI (Index), CO (Offset).QUESTION 5
What would the hit rate be if the cache were twice as big for a grid array scan problem with a 64 B block size?
It would double exactly.
It depends on spatial locality; for a stride-1 scan, it remains (BlockSize - sizeof(type)) / BlockSize.
It would become 100% because of compulsory misses.
It would decrease because of conflict misses.
✅ Correct!
For simple sequential scans, increasing cache size doesn't necessarily improve the hit rate if the bottleneck is spatial locality within blocks.❌ Incorrect
Hit rates for sequential scans are limited by the block size and first-time access (cold/compulsory misses).DRAM Dimension Optimization & Stride Analysis
Physical Layout vs. Algorithmic Access
Consider a system using a 512 x 4 DRAM module. The physical organization requires minimizing the pin count while the software executes a matrix transpose.
Q
Determine the power-of-2 array dimensions (r, c) that minimize max(br, bc) for a 512 x 4 DRAM.
Solution:
For a 512 x 4 DRAM, the total number of supercells is d = 512. To minimize max(br, bc), the grid should be as close to square as possible. Since 512 is not a perfect square, we look for powers of 2. $2^9$ = 512. We can use r = 32 ($2^5$) and c = 16 ($2^4$), or vice-versa. Here, br = 5 and bc = 4. Thus, max(br, bc) = 5. The dimensions are 32 x 16.
For a 512 x 4 DRAM, the total number of supercells is d = 512. To minimize max(br, bc), the grid should be as close to square as possible. Since 512 is not a perfect square, we look for powers of 2. $2^9$ = 512. We can use r = 32 ($2^5$) and c = 16 ($2^4$), or vice-versa. Here, br = 5 and bc = 4. Thus, max(br, bc) = 5. The dimensions are 32 x 16.
Q
How does the DRAM row-buffer (RAS/CAS) mechanism impact the performance of 'dst[j*dim + i] = src[i*dim + j]' when 'dim' is a large power of 2?
Solution:
When 'dim' is a large power of 2, 'src' is accessed in row-major (good spatial locality, row-buffer hits), but 'dst' is accessed in column-major (stride-N). This causes a 'RAS' request for every single write, as each 'dst' access likely maps to a different DRAM row, forcing the memory controller to constantly close and open rows (thrashing the row buffer).
When 'dim' is a large power of 2, 'src' is accessed in row-major (good spatial locality, row-buffer hits), but 'dst' is accessed in column-major (stride-N). This causes a 'RAS' request for every single write, as each 'dst' access likely maps to a different DRAM row, forcing the memory controller to constantly close and open rows (thrashing the row buffer).